GH-48897: [C++] Benchmark and optimize CountSetBits#48898
Conversation
|
Also @AntoinePrv FYI |
|
@ursabot please benchmark |
|
Hmm, it seems performance is behind the expected theoretical throughput. From Agner Fog's instruction tables, I see that AMD Zen 2 should be able to sustain 4 POPCNT operations/cycle (reciprocal throughput = 0.25), i.e. 32 bytes/cycle on 64-bit ints. |
08383d7 to
9921e9d
Compare
|
Ok, the nested for-loop is un-nested by gcc 15.2.0... |
9921e9d to
d0f45cf
Compare
|
Updated benchmark numbers after I hand-unrolled the loop. |
|
@github-actions crossbow submit -g cpp |
|
Revision: d0f45cf Submitted crossbow builds: ursacomputing/crossbow @ actions-cdbe33a753 |
|
@ursabot please benchmark |
|
@rok I have deleted the branch, so I'm not sure that can work? |
|
I see the event on kubernetes, but the github api token was expired so it couldn't post back. |
|
Trying on #48907 |
|
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit ed35594. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 10 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
Counting the set bits in a null bitmap is an operation that comes often, it can be useful to get a more precise idea of its performance.
What changes are included in this PR?
CountSetBits.Local results (AMD Zen 2):
Local results (Intel(R) Core(TM) Ultra 7 255H):
Are these changes tested?
By running said benchmark manually (and by Continuous Benchmarking).
Are there any user-facing changes?
No.